Exercises

data_tibble <- read_delim(file = paste0(working_directory_path, "/data/final_data.csv"),
                          show_col_types = FALSE)

data_tibble <- data_tibble %>% 
  filter(Classification_2020 != ".")

data_tibble <- data_tibble %>% 
  mutate(Classification_2020 = factor(Classification_2020, 
                                      levels = c("L", "LM", "UM", "H"), 
                                      labels = c("Low", "Lower-middle", "Upper-middle", "High"), 
                                      ordered = TRUE)
  )

data_tibble <- data_tibble %>%
  mutate(
    Continent = factor(Continent),
    Region = factor(Region)
  )

data_tibble <- data_tibble %>%
 mutate(
    Net_Migration_Rate = na_if(Net_Migration_Rate, ".") %>% as.double(),
    Median_Age = na_if(Median_Age, ".") %>% as.double(),
    Youth_Unemployment_Rate = na_if(Youth_Unemployment_Rate, ".") %>% as.double()
  )

head(data_tibble)
## # A tibble: 6 × 8
##   Net_Migration_Rate Median_Age Youth_Unemployment_R…¹ ISO   Classification_2020
##                <dbl>      <dbl>                  <dbl> <chr> <ord>              
## 1               27.1       23.5                   35.8 SYR   Low                
## 2               15.5       37.2                   NA   VGB   High               
## 3               13.3       39.5                   14.2 LUX   High               
## 4               13         40.5                   13.8 CYM   High               
## 5               11.8       35.6                    9.1 SGP   High               
## 6               10.6       32.9                    5.3 BHR   High               
## # ℹ abbreviated name: ¹​Youth_Unemployment_Rate
## # ℹ 3 more variables: Country <chr>, Region <fct>, Continent <fct>

a. Median age in different income levels

Using ggplot2, create a density plot of the median age grouped by income status groups. The densities for the different groups are superimposed in the same plot rather than in different plots. Ensure that you order the levels of the income status such that in the plots the legend is ordered from High (H) to Low (L).

  • The color of the density lines is black.
  • The area under the density curve should be colored differently among the income status levels.
  • For the colors, choose a transparency level of 0.5 for better visibility.
  • Position the legend at the top center of the plot and give it no title (hint: use element_blank()).
  • Rename the x axis as “Median age of population”

Comment briefly on the plot.

Answer

# Filter out non-finite values in the Median_Age column
filtered_data <- data_tibble %>%
  filter(is.finite(Median_Age))

# Create the density plot
ggplot(filtered_data, aes(x = Median_Age, fill = Classification_2020)) +
  geom_density(alpha = 0.5, color = "black") +
  scale_fill_manual(values = colorblind_palette) +
  labs(x = "Median age of population") +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank(),
        legend.text = element_text(size = 10),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12))

The density plot reveals that low-income countries have predominantly younger populations, peaking around the 20-25 age range. As income status increases, the median age of populations also increases, with high-income countries showing the oldest populations, peaking around 40-45 years. This indicates a trend where higher-income countries tend to have older populations, while lower-income countries have younger populations.

b. Income status in different continents

Investigate how the income status is distributed in the different continents.

  • Using ggplot2, create a stacked barplot of absolute frequencies showing how the entities are split into continents and income status. Comment the plot.
  • Create another stacked barplot of relative frequencies (height of the bars should be one). Comment the plot.
  • Create a mosaic plot of continents and income status using base R functions.
  • Briefly comment on the differences between the three plots generated to investigate the income distribution among the different continents.

Answer

# Create stacked barplot of absolute frequencies
ggplot(data_tibble, aes(x = Continent, fill = Classification_2020)) +
  geom_bar(position = "stack") +
  scale_fill_manual(values = colorblind_palette) +
  labs(x = "Continent", y = "Count", fill = "Income Status") +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank(),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12))

The absolute frequencies plot shows that Africa predominantly consists of low and lower-middle-income countries, reflecting its economic challenges, while Europe and North America are mostly high-income, indicating advanced economic development. Asia displays a diverse economic landscape with significant representation across all income categories. Oceania and South America show a mix of income levels, highlighting varying development within these regions.

# Create stacked barplot of relative frequencies
ggplot(data_tibble, aes(x = Continent, fill = Classification_2020)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = colorblind_palette) +
  labs(x = "Continent", y = "Proportion", fill = "Income Status") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank(),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12))

The relative frequencies plot makes it easier to compare the proportion of income statuses within each continent but obscures the actual number of countries. In contrast, the absolute frequencies plot clearly shows the total count of countries in each income status category, making it easier to see the magnitude but harder to compare proportions within continents. Thus, the relative plot is better for internal distribution comparison, while the absolute plot is better for understanding total counts.

# Create mosaic plot
mosaicplot(~ Continent + Classification_2020, 
           data = data_tibble, 
           color = colorblind_palette,
           main = "Mosaic Plot of Continents and Income Status",
           xlab = "Continent", 
           ylab = "Income Status")

The mosaic plot combines the strengths of both the relative and absolute frequency plots by showing both the proportions and the total counts of income statuses within each continent. It uses tile sizes to represent absolute counts and tile areas to reflect relative proportions, providing a comprehensive view of the data. This allows for easy comparison of both the distribution balance within continents and the magnitude of each category.

c. Income status in different subcontinents

For Asia, investigate further how the income status distribution is in the different subcontinents. Use one of the plots in b. for this purpose. Comment on the results.

Answer

# Filter data to include only Asia
asia_data <- data_tibble %>%
  filter(Continent == "Asia")

# Create stacked barplot of absolute frequencies for Asian subcontinents
ggplot(asia_data, aes(x = Region, fill = Classification_2020)) +
  geom_bar(position = "stack") +
  scale_fill_manual(values = colorblind_palette) +
  labs(x = "Subcontinent", y = "Count", fill = "Income Status") +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank(),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12))

The absolute stacked barplot reveals that Western Asia has the highest diversity in income statuses, with significant representation in all categories, including a notable portion in the high-income category. Eastern Asia shows a strong presence of high-income and upper-middle-income countries, reflecting its economic development. Central Asia, South-eastern Asia, and Southern Asia are predominantly lower-middle-income regions, indicating more uniform economic status within these subcontinents.

d. Net migration in different continents

  • Using ggplot2, create parallel boxplots showing the distribution of the net migration rate in the different continents.
  • Prettify the plot (change y-, x-axis labels, etc).
  • Identify which country in Asia constitutes the largest negative outlier and which country in Asia constitutes the largest positive outlier.
  • Comment on the plot.

Answer

# Filter out non-finite values
filtered_data <- data_tibble %>%
  filter(is.finite(Net_Migration_Rate))

# Create parallel boxplots
ggplot(filtered_data, aes(x = Continent, y = Net_Migration_Rate)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 2) +
  labs(x = "Continent", y = "Net Migration Rate") +
  ggtitle("Distribution of Net Migration Rate by Continent") +
  theme_minimal() +
  theme(axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        plot.title = element_text(size = 14, face = "bold", hjust = 0.5))

# Create parallel boxplots again, 
# but hide the extreme negative outlier
ggplot(filtered_data, aes(x = Continent, y = Net_Migration_Rate)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 2) +
  labs(x = "Continent", y = "Net Migration Rate") +
  ggtitle("Distribution of Net Migration Rate by Continent\n(excl. Lebanon)") +
  coord_cartesian(ylim = c(-30, 30)) +
  theme_minimal() +
  theme(axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        plot.title = element_text(size = 14, face = "bold", hjust = 0.5))

To facilitate the evaluation of the plot, Lebanon was removed in the second analysis because its extremely low net migration rate skewed the scale, making the data from the other continents harder to discern.

The boxplot shows net migration rate distributions across continents, excluding extreme negative outliers. Asia and Africa have the most variation, with many outliers, indicating diverse migration dynamics. This suggests different countries experience varying economic, political, and social conditions. Europe, North America, and Oceania have more centralized distributions with fewer outliers, reflecting stable migration trends due to established economic conditions and policies. South America shows a narrower range and fewer extreme values, indicating more uniform migration trends within the continent.

# Filter data for Asia and remove non-finite values
asia_data <- data_tibble %>%
  filter(Continent == "Asia" & is.finite(Net_Migration_Rate))

# Identify the largest negative and positive outliers in Asia
largest_negative_outlier <- asia_data %>%
  filter(Net_Migration_Rate == min(Net_Migration_Rate, na.rm = TRUE)) %>%
  select(Country, Net_Migration_Rate)

largest_positive_outlier <- asia_data %>%
  filter(Net_Migration_Rate == max(Net_Migration_Rate, na.rm = TRUE)) %>%
  select(Country, Net_Migration_Rate)

# Display the results
print(largest_negative_outlier)
## # A tibble: 1 × 2
##   Country Net_Migration_Rate
##   <chr>                <dbl>
## 1 Lebanon              -88.7
print(largest_positive_outlier)
## # A tibble: 1 × 2
##   Country Net_Migration_Rate
##   <chr>                <dbl>
## 1 Syria                 27.1

e. Net migration in different subcontinents

The graph in d. clearly does not convey the whole picture. It would be interesting also to look at the subcontinents, as it is likely that a lot of migration flows happen within the continent.

  • Investigate the net migration in different subcontinents using again parallel boxplots. Group the boxplots by continent (hint: use facet_grid with scales = “free_x”).
  • Remember to prettify the plot (rotate axis labels if needed).
  • Describe what you see.

Answer

# Filter out non-finite values
filtered_data <- data_tibble %>%
  filter(is.finite(Net_Migration_Rate))

# Create parallel boxplots grouped by continent
ggplot(filtered_data, aes(x = Region, y = Net_Migration_Rate)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 2) +
  labs(x = "Subcontinent", y = "Net Migration Rate") +
  ggtitle("Distribution of Net Migration Rate by Subcontinent") +
  coord_cartesian(ylim = c(-30, 30)) +
  facet_grid(. ~ Continent, scales = "free_x") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
        strip.text.x = element_text(size = 12, face = "bold"))

When splitting continents into subcontinents, the detailed breakdown reveals significant regional variations within each continent that are not visible in the broader view. Subcontinent-specific migration patterns and outliers become apparent, such as the distinct differences between Central and Western Asia or among the subregions of Africa.

f. Median net migration rate per subcontinent.

The plot in task e. shows the distribution of the net migration rate for each subcontinent. Here you will work on visualizing only one summary statistic, namely the median. For each subcontinent, calculate the median net migration rate. Then create a plot which contains the sub-regions on the y-axis and the median net migration rate on the x-axis.

  • As geoms use points.
  • Color the points by continent – use a colorblind friendly palette (see e.g., here.
  • Rename the axes.
  • Using fct_reorder from the forcats package, arrange the levels of subcontinent such that in the plot the lowest (bottom) subcontinent contains the lowest median net migration rate and the upper most region contains the highest median net migration rate.
  • Comment on the plot. E.g., what are the regions with the most influx? What are the regions with the most outflux?

Answer

data_tibble <- data_tibble %>% filter(!is.na(Youth_Unemployment_Rate))
median_migration <- data_tibble %>%
  group_by(Region, Continent) %>%
  summarize(median_net_migration_rate = median(Net_Migration_Rate)) %>%
  ungroup()
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
median_migration <- median_migration %>%
  mutate(Region = fct_reorder(Region, median_net_migration_rate, .na_rm = TRUE))

ggplot(median_migration, aes(x = median_net_migration_rate, y = Region, color = Continent)) +
  geom_point(size = 4, na.rm = TRUE) +
  scale_color_brewer(palette = "Set2") +
  labs(
    x = "Median Net Migration Rate",
    y = "Subcontinent",
    title = "Median Net Migration Rate by Subcontinent",
    color = "Continent"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10))

This plot shows that eastern regions like Central Asia, Polynesia and Micronesia show the most outflux, with the latter islands even more seperated from the pack. Most African regions show a balanced net migration rate. North America, large parts of Europe and Australia are the continents with the largest influx, with Australia heavily dominating this statistic. Given the geographic proximity of Australia and Micronesia and Polynesia, it seems likely that a large amount of the outflux of the latter migrate to Australia and thus explain the median net migration rates of both these regions.

g. Median youth unemployment rate per subcontinent

For each subcontinent, calculate the median youth unemployment rate. Then create a plot which contains the sub-regions on the y-axis and the median unemployment rate on the x-axis.

  • Use a black and white theme (?theme_bw())
  • As geoms use bars. (hint: pay attention to the statistical transformation taking place in geom_bar() – look into argument stat=“identity”)
  • Color the bars by continent – use a colorblind friendly palette.
  • Make the bars transparent (use alpha = 0.7).
  • Rename the axes.
  • Using fct_reorder from the forcats package, arrange the levels of subcontinent such that in the plot the lowest (bottom) subcontinent contains the lowest median youth unemployment rate and the upper most region contains the highest median youth unemployment rate.
  • Comment on the plot. E.g., what are the regions with the highest vs lowest youth unemployment rate?

Answer

median_unemployment <- data_tibble %>%
  group_by(Region, Continent) %>%
  summarize(median_youth_unemployment_rate = median(Youth_Unemployment_Rate)) %>%
  ungroup()
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
median_unemployment <- median_unemployment %>%
  mutate(region = fct_reorder(Region, median_youth_unemployment_rate))

ggplot(median_unemployment, aes(x = median_youth_unemployment_rate, y = region, fill = Continent)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    x = "Median Youth Unemployment Rate",
    y = "Subcontinent",
    title = "Median Youth Unemployment Rate by Subcontinent",
    fill = "Continent"
  ) +
  theme_bw() +
  theme(axis.text.y = element_text(size = 10))

Youth unemployment is most prevalent in Southern and Northern Africa, but also in Polynesia and Southern Europe. Generally, Asia and Europe score rather low on Youth unemployment, with southern regions tending to perform worse for almost all continents. Interestingly enough, even though most regions of Africa have rather high values, Western Africa belongs to the top performers regarding Youth unemployment.

h. Median youth unemployment rate per subcontinent – with error bars

The value displayed in the barplot in g. is the result of an aggregation, so it might be useful to also plot error bars, to have a general idea on how precise the median unemployment is. This can be achieved by plotting the error bars which reflect the standard deviation or the interquartile range of the variable in each of the subcontinents.

Repeat the plot in h. but include also error bars which reflect the 25% and 75% quantiles. You can use geom_errorbar in ggplot2.

Answer

quantiles_unemployment <- data_tibble %>%
  group_by(Region, Continent) %>%
  summarize(
    median_youth_unemployment_rate = median(Youth_Unemployment_Rate),
    q25 = quantile(Youth_Unemployment_Rate, 0.25),
    q75 = quantile(Youth_Unemployment_Rate, 0.75)
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.
quantiles_unemployment <- quantiles_unemployment %>%
  mutate(region = fct_reorder(Region, median_youth_unemployment_rate))

ggplot(quantiles_unemployment, aes(x = median_youth_unemployment_rate, y = region, fill = Continent)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  geom_errorbar(aes(xmin = q25, xmax = q75), width = 0.2) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    x = "Median Youth Unemployment Rate",
    y = "Subcontinent",
    title = "Median Youth Unemployment Rate by Subcontinent with Error Bars",
    fill = "Continent"
  ) +
  theme_bw() +
  theme(axis.text.y = element_text(size = 10))

i. Relationship between median age and net migration rate

Using ggplot2, create a plot showing the relationship between median age and net migration rate.

  • Color the geoms based on the income status.
  • Add a regression line for each development status (using geom_smooth()).

Comment on the plot. Do you see any relationship between the two variables? Do you see any difference among the income levels?

Answer

ggplot(data_tibble, aes(x = Median_Age, y = Net_Migration_Rate, color = Classification_2020)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Median Age",
    y = "Net Migration Rate",
    title = "Relationship between Median Age and Net Migration Rate",
    color = "Income Status"
  ) +
  theme_minimal() +
  theme(axis.text = element_text(size = 10))
## `geom_smooth()` using formula = 'y ~ x'

The most eye-catching regression can be observed for the Low Income Status group. However, this regression line is clearly heavily influenced by an outlier at the top of the graph. Using a robust regression function could be benefitial here. All other income status classes feature a similar behaviour, with their respective regression lines being almost parallel.

j. Relationship between youth unemployment and net migration rate

Create a plot as in Task f. but for youth unemployment and net migration rate. Comment briefly.

Answer

ggplot(data_tibble, aes(x = Youth_Unemployment_Rate, y = Net_Migration_Rate, color = Classification_2020)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Youth Unemployment Rate",
    y = "Net Migration Rate",
    title = "Relationship between Youth Unemployment Rate and Net Migration Rate",
    color = "Income Status"
  ) +
  theme_minimal() +
  theme(axis.text = element_text(size = 10))
## `geom_smooth()` using formula = 'y ~ x'

Here, different trends are visible for the low income status group compared with the other ones. For the low income group, a higher youth unemployment rate correlates with a higher net migration rate. The other income status groups tend to the opposite, with higher youth unemployment rates correlating with lower net migration rates.

k. Median net migration rate per subcontinent.

Go online and find a data set which contains the 2020 population for the countries of the world together with ISO codes.

  • Download this data and merge it to the dataset you are working on in this case study using a left join. (A possible source: World Bank)
  • Inspect the data and check whether the join worked well.
pop_data <- read.csv("../data/country_population_data.csv")
pop_data <- pop_data[c("Country.Code", "X2020")]
colnames(pop_data) <- c("ISO", "Population_2020")
full_data <- left_join(data_tibble, pop_data)
## Joining with `by = join_by(ISO)`
head(full_data)
## # A tibble: 6 × 9
##   Net_Migration_Rate Median_Age Youth_Unemployment_R…¹ ISO   Classification_2020
##                <dbl>      <dbl>                  <dbl> <chr> <ord>              
## 1               27.1       23.5                   35.8 SYR   Low                
## 2               15.5       37.2                   NA   VGB   High               
## 3               13.3       39.5                   14.2 LUX   High               
## 4               13         40.5                   13.8 CYM   High               
## 5               11.8       35.6                    9.1 SGP   High               
## 6               10.6       32.9                    5.3 BHR   High               
## # ℹ abbreviated name: ¹​Youth_Unemployment_Rate
## # ℹ 4 more variables: Country <chr>, Region <fct>, Continent <fct>,
## #   Population_2020 <dbl>

For the most part the join worked pretty well, I chose to use the “left_join” so that countries that are not in the original data but are in the population data will be ignored (else this would lead to rows where most of the values are N/A). Only 2 problems occured during the merge: For Kosovo, the original dataset had XKS as the ISO code, while the population set had “XKX” - this was easily corrected by changing the country code in the population data. For Taiwan, no population data was available on the “World Bank”-Site, since it is taken as a part of china \(\rightarrow\) no data is available for just Taiwan alone.

l. Scatterplot of median age and net migration rate in Europe

Make a scatterplot of median age and net migration rate for the countries of Europe. Scale the size of the points according to each country’s population.

  • For better visibility, use a transparency of alpha=0.7.
  • Remove the legend.
  • Comment on the plot.
scatterplot_data <- full_data %>% filter(Continent=="Europe")
ggplot(scatterplot_data) + aes(x=Median_Age, y=Net_Migration_Rate, size=Population_2020, alpha=0.7, color=Country) + geom_point() + theme(legend.position = "none") + ggtitle("Median age vs. Migration Rate in Europe")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

The median age of most European countries is above 40 years. There also doesn’t seem to be any visible relation between the net migration rate and the median age, most of the datapoints are concentrated in a “blob” in the middle.

m. Interactive plot

On the merged data set from Task k., using function ggplotly from package plotly re-create the scatterplot in Task l., but this time for all countries. Color the points according to their continent.

When hovering over the points the name of the country, the values for median age, net migration rate, and population should be shown. (Hint: use the aesthetic text = Country. In ggplotly use the argument tooltip = c(“text”, “x”, “y”, “size”)).

p <- ggplot(full_data) + aes(x=Median_Age, y=Net_Migration_Rate, size=Population_2020, alpha=0.7, color=Continent, text=Country) + geom_point() + theme(legend.position = "none") + ggtitle("Median age vs. Migration Rate, Worldwide")
pltly <- ggplotly(p, tooltip=c("text", "x", "y", "size"))
pltly

n. Parallel coordinate plot

In parallel coordinate plots each observation or data point is depicted as a line traversing a series of parallel axes, corresponding to a specific variable or dimension. It is often used for identifying clusters in the data.

One can create such a plot using the GGally R package. You should create such a plot where you look at the three main variables in the data set: median age, youth unemployment rate and net migration rate. Color the lines based on the income status. Briefly comment.

library(GGally)
ggparcoord(data=full_data, columns=c(1:3), groupColumn="Classification_2020", title="Income-Class to Data Comparison, Worldwide")

The net migration rate and median rate seem to move up as the income-class rises for each country, while the youth unemployment-rate is more spread evenly between these classes (however most of the high-income countries still seem to have a pretty low youth unemployment rate.)

o. World map visualisation

Using the package rworldmap, create a world map of the median age per country. Use the vignette to find how to do this in R.

library(rworldmap)
cdatamap <- joinCountryData2Map(full_data, joinCode = "ISO3", nameJoinColumn = "ISO")
## 215 codes from your data successfully matched countries in the map
## 3 codes from your data failed to match with a country code in the map
## 29 codes from the map weren't represented in your data
par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
mapCountryData(cdatamap, nameColumnToPlot = "Median_Age")